One of the most important steps in the history of Python was probably the release of Python 3.0. The most notable changes that happened in that release were:
As we know from Chapter 1, Current Status of Python, Python 3 isn't backward-incompatible with Python 2. This is the main reason why it took so many years for the Python community to fully embrace it. That was a tough, albeit necessary, lesson for Python core developers and the Python community.
Fortunately, problems associated with the adoption of Python 3 didn't stop the process of language evolution. Since December 3, 2008 (the official release of Python 3.0), we've seen a stable inflow of new major Python updates. Every new release brought new improvements to the language, its standard library, and its interpreter. Moreover, beginning with version 3.9, Python has adopted an annual release cycle. This means we will have access to new features and improvements every year.
If you want to learn more about the Python release cycle, read the PEP 602—Annual Release Cycle for Python document, available at https://www.python.org/dev/peps/pep-0602/.
In this chapter, we will take a closer look at the recent Python evolution. We will review a number of important additions across the latest few releases. We will also take a speculative look into the future and present a few features that have been accepted in the PEP process and will become an official part of the Python programming language in the very near future. Along the way, we'll cover the following topics:
But before we review those features, let's begin by considering the technical requirements.
The following are the Python packages that are mentioned in this chapter that you can download from PyPI:
mypy
pyright
Information on how to install packages is included in Chapter 2, Modern Python Development Environments.
The code files for this chapter can be found at https://github.com/PacktPublishing/Expert-Python-Programming-Fourth-Edition/tree/main/Chapter%203.
Every release of Python comes with it a lot of changes of different types. Almost every release of Python brings some new syntax elements. However, the majority of the changes are related to Python's standard library, the CPython interpreter, the Python API, and CPython's C API. Due to space limitations, it is impossible to cover all of these in this book. That is why we will focus just on new syntax features and new additions to the standard library.
In terms of the two latest versions of Python, we can distinguish four main syntax updates:
These four features would best be described as quality-of-life improvements. They do not introduce any new programming paradigms, nor drastically change the way your code can be written. They simply allow for better coding patterns or enable stricter API definition.
In recent years, Python core developers have been primarily focused on removing dead or redundant modules from the standard library rather than adding anything new. Still, from time to time, we see some standard library additions. In the last two releases, we have been the beneficiaries of two completely new modules:
zoneinfo
module for supporting the IANA (Internet Assigned Numbers Authority) time zone database (added in Python 3.9)graphlib
module for operating with graph-like structures (added in Python 3.8)Both modules are fairly small with regards to their API size. Later, we will discuss some example areas where you could apply them. But first, let's zoom into the syntax updates incorporated in Python 3.8 and Python 3.9.
Python allows the use of a number of selected arithmetic operators to manipulate the built-in container types, including lists, tuples, sets, and dictionaries.
For lists and tuples, you can use the +
(addition) operator to concatenate two variables as long as they are the same type. There is also the += operator, which allows for the in-place modification of existing variables. The following transcript presents examples of the concatenation of lists and tuples in an interactive session:
>>> [1, 2, 3] + [4, 5, 6]
[1, 2, 3, 4, 5, 6]
>>> (1, 2, 3) + (4, 5, 6)
(1, 2, 3, 4, 5, 6)
>>> value = [1, 2, 3]
>>> value += [4, 5, 6]
>>> value
[1, 2, 3, 4, 5, 6]
>>> value = (1, 2, 3)
>>> value += (4, 5, 6)
>>> value
(1, 2, 3, 4, 5, 6)
When it comes to sets, there are exactly four binary operators (having two operands) that produce a new set:
&
(bitwise OR). This produces a set with elements common to both sets.|
(bitwise OR). This produces a set of all elements in both sets.-
(subtraction). This produces a set with elements in the left-hand set that are not in the right-hand set.^
(bitwise XOR). This produces a set with elements of both sets that are in either of the sets but not both.The following transcript presents examples of intersection and union operations on sets in an interactive session:
>>> {1, 2, 3} & {1, 4}
{1}
>>> {1, 2, 3} | {1, 4}
{1, 2, 3, 4}
>>> {1, 2, 3} - {1, 4}
{2, 3}
>>> {1, 2, 3} ^ {1, 4}
{2, 3, 4}
For a very long time, Python didn't have a dedicated binary operator that would permit the production of a new dictionary from two existing dictionaries. Starting with Python 3.9, we can use the |
(bitwise OR) and |=
(in-place bitwise OR) operators to perform a dictionary merge and update operations on dictionaries. That should be the idiomatic way of producing a union of two dictionaries. The reasoning behind adding new operators was outlined in the PEP 584—Add Union Operators To Dict document.
A programming idiom is the common and most preferable way of performing specific tasks in a given programming language. Writing idiomatic code is an important part of Python culture. The Zen of Python says: "There should be one—and preferably only one—obvious way to do it."
We will discuss more idioms in Chapter 4, Python in Comparison with Other Languages.
In order to merge two dictionaries into a new dictionary, use the following expression:
dictionary_1 | dictionary_2
The resulting dictionary will be a completely new object that will have all the keys of both source dictionaries. If both dictionaries have overlapping keys, the resulting object will receive values from the rightmost object.
Following is an example of using this syntax on two dictionary literals, where the dictionary on the left is updated with values from the dictionary on the right:
>>> {'a': 1} | {'a': 3, 'b': 2}
{'a': 3, 'b': 2}
If you prefer to update the dictionary variable with the keys coming from a different dictionary, you can use the following in-place operator:
existing_dictionary |= other_dictionary
The following is an example of usage with a real variable:
>>> mydict = {'a': 1}
>>> mydict |= {'a': 3, 'b': 2}
>>> mydict
{'a': 3, 'b': 2}
In older versions of Python, the simplest way to update an existing dictionary with the contents of another dictionary was to use the update()
method, as in the following example:
existing_dictionary.update(other_dictionary)
This method modifies existing_dictionary
in place and returns no value. This means that it does not allow the straightforward production of a merged dictionary as an expression and is always used as a statement.
The difference between expressions and statements will be explained in the Assignment expressions section.
It is a little-known fact that Python already supported a fairly concise way to merge two dictionaries before version 3.9 through a feature known as dictionary unpacking. Support for dictionary unpacking in dict
literals was introduced in Python 3.5 with PEP 448 Additional Unpacking Generalizations. The syntax for unpacking two (or more) dictionaries into a new object is as follows:
{**dictionary_1, **dictionary_2}
The example involving real literals is as follows:
>>> a = {'a': 1}; b = {'a':3, 'b': 2}
>>> {**a, **b}
{'a': 3, 'b': 2}
This feature, together with list unpacking (with *value
syntax), may be familiar for those who have experience of writing functions that can accept an undefined set of arguments and keyword arguments, also known as variadic functions. This is especially useful when writing decorators.
We will discuss the topic of variadic functions and decorators in detail in Chapter 4, Python in Comparison with Other Languages.
You should remember that dictionary unpacking, while extremely popular in function definitions, is an especially rare method of merging dictionaries. It may confuse less experienced programmers who are reading your code. That is why you should prefer the new merge operator over dictionary unpacking in code that runs in Python 3.9 and newer versions. For older versions of Python, it is sometimes better to use a temporary dictionary and a simple update()
method.
Yet another way to create an object that is, functionally speaking, a merge of two dictionaries is through the ChainMap
class from the collections
module. This is a wrapper class that takes multiple mapping objects (dictionaries in this instance) and acts as if it was a single mapping object.
The syntax for merging two dictionaries with ChainMap
is as follows:
new_map = ChainMap(dictionary_2, dictionary_1)
Note that the order of dictionaries is reversed compared to the |
operator. This means that if you try to access a specific key of the new_map
object, it will perform lookups over wrapped objects in a left-to-right order. Consider the following transcript, which illustrates examples of operations using the ChainMap
class:
>>> from collections import ChainMap
>>> user_account = {"iban": "GB71BARC20031885581746", "type": "account"}
>>> user_profile = {"display_name": "John Doe", "type": "profile"}
>>> user = ChainMap(user_account, user_profile)
>>> user["iban"]
'GB71BARC20031885581746'
>>> user["display_name"]
'John Doe'
>>> user["type"]
'account'
In the preceding example, we can clearly see that the resulting user
object of the ChainMap
type contains keys from both the user_account
and user_profile
dictionaries. If any of the keys overlap, the ChainMap
instance will return the value of the leftmost mapping that has the specific key. That is the complete opposite of the dictionary merge operator.
ChainMap
is a wrapper object. This means that it doesn't copy the contents of the source mappings provided, but stores them as a reference. This means that if underlying objects change, ChainMap
will be able to return modified data. Consider the following continuation of the previous interactive session:
>>> user["display_name"]
'John Doe'
>>> user_profile["display_name"] = "Abraham Lincoln"
>>> user["display_name"]
'Abraham Lincoln'
Moreover, ChainMap
is writable and populates changes back to the underlying mapping. What you need to remember is that writes, updates, and deletes only affect the leftmost mapping. If used without proper care, this can lead to some confusing situations, as in the following continuation of the previous session:
>>> user["display_name"] = "John Doe"
>>> user["age"] = 33
>>> user["type"] = "extension"
>>> user_profile
{'display_name': 'Abraham Lincoln', 'type': 'profile'}
>>> user_account
{'iban': 'GB71BARC20031885581746', 'type': 'extension', 'display_name': 'John Doe', 'age': 33}
In the preceding example, we can see that the'display_name'
key was populated back to the user_account
dictionary, where user_profile
was the initial source dictionary holding such a key. In many contexts, such backpropagating behavior of ChainMap
is undesirable. That's why the common idiom for using it for the purpose of merging two dictionaries actually involves explicit conversion to a new dictionary. The following is an example that uses previously defined input dictionaries:
>>> dict(ChainMap(user_account, user_profile))
{'display_name': 'John Doe', 'type': 'account', 'iban': 'GB71BARC20031885581746'}
If you want to simply merge two dictionaries, you should prefer a new merge operator over ChainMap
. However, this doesn't mean that ChainMap
is completely useless. If the back and forth propagation of changes is your desired behavior, ChainMap
will be the class to use. Also, ChainMap
works with any mapping type. So, if you need to provide unified access to multiple objects that act as if they were dictionaries, ChainMap
will enable the provision of a single merge-like unit to do so.
If you have a custom dict-like class, you can always extend it with the special __or__()
method to provide compatibility with the |
operator instead of using ChainMap
. Overriding special methods will be covered in Chapter 4, Python in Comparison with Other Languages. Anyway, using ChainMap
is usually easier than writing a custom __or__()
method and will allow you to work with pre-existing object instances of classes that you cannot modify.
Usually, the most important reason for using ChainMap
over dictionary unpacking or the union operator is backward compatibility. On Python versions older than 3.9, you won't be able to use the new dictionary merge operator syntax. So, if you have to write code for older versions of Python, use ChainMap
. If you don't, it is better to use the merge operator.
Another syntax change that has a big impact on backward compatibility is assignment expressions.
Assignment expressions are a fairly interesting feature because their introduction affected the fundamental part of Python syntax: the distinction between expressions and statements. Expressions and statements are the key building blocks of almost every programming language. The difference between them is really simple: expressions have a value, while statements do not.
Think of statements as consecutive actions or instructions that your program executes. So, value assignments, if
clauses, together with for
and while
loops, are all statements. Function and class definitions are statements, too.
Think of expressions as anything that can be put into an if
clause. Typical examples of expressions are literals, values returned by operators (excluding in-place operators), and comprehensions, such as list, dictionary, and set comprehensions. Function calls and method calls are expressions, too.
There are some elements of the many programming languages that are often inseparably bound to statements. These are often:
if...else
clausesPython was able to break that barrier by providing syntax features that were expression counterparts of such language elements, namely:
lambda x: x**2
type("MyClass", (), {})
squares_of_2 = [x**2 for x in range(10)]
if … else
statements:
"odd" if number % 2 else "even"
For many years, however, we haven't had access to syntax that would convey the semantics of assigning a value to a variable in the form of an expression, and this was undoubtedly a conscious design choice on the part of Python's creators. In languages such as C, where variable assignment can be used both as an expression and as a statement, this often leads to situations where the assignment operator is confused by the equality comparison. Anyone who has programmed in C can attest to the fact that this is a really annoying source of errors. Consider the following example of C code:
int err = 0;
if (err = 1) {
printf("Error occured");
}
And compare it with the following:
int err = 0;
if (err == 1) {
printf("Error occured");
}
Both are syntactically valid in C because err = 1
is an expression in C that will evaluate to the value 1
. Compare this with Python, where the following code will result in a syntax error:
err = 0
if err = 1:
printf("Error occured")
On rare occasions, however, it may be really handy to have a variable assignment operation that would evaluate to a value. Luckily, Python 3.8 introduced the dedicated :=
operator, which assigns a value to the variable but acts as an expression instead of a statement. Due to its visual appearance, it was quickly nicknamed the walrus operator.
The use cases for this operator are, quite frankly, limited. They help to make code more concise. And often, more concise code is easier to understand because it improves the signal-to-noise ratio. The most common scenario for the walrus operator is when a complex value needs to be evaluated and then immediately used in the statements that follow.
A commonly referenced example is working with regular expressions. Let's imagine a simple application that reads source code written in Python and scans it with regular expressions looking for imported modules.
Without the use of assignment expressions, the code could appear as follows:
import os
import re
import sys
import_re = re.compile(
r"^\s*import\s+\.{0,2}((\w+\.)*(\w+))\s*$"
)
import_from_re = re.compile(
r"^\s*from\s+\.{0,2}((\w+\.)*(\w+))\s+import\s+(\w+|\*)+\s*$"
)
if __name__ == "__main__":
if len(sys.argv) != 2:
print(f"usage: {os.path.basename(__file__)} file-name")
sys.exit(1)
with open(sys.argv[1]) as file:
for line in file:
match = import_re.search(line)
if match:
print(match.groups()[0])
match = import_from_re.search(line)
if match:
print(match.groups()[0])
As you can observe, we had to repeat twice the pattern that evaluates the match of complex expressions and then retrieves grouped tokens. That block of code could be rewritten with assignment expressions in the following way:
if match := import_re.match(line):
print(match.groups()[0])
if match := import_from_re.match(line):
print(match.groups()[0])
As you can see, there is a small improvement in terms of readability, but it isn't dramatic. This type of change really shines in situations where you need to repeat the same pattern multiple times. The continuous assignment of temporary results to the same variable can make code look unnecessarily bloated.
Another use case could be reusing the same data in multiple places in larger expressions. Consider the example of a dictionary literal that represents some predefined data of an imaginary user:
first_name = "John"
last_name = "Doe"
height = 168
weight = 70
user = {
"first_name": first_name,
"last_name": last_name,
"display_name": f"{first_name} {last_name}",
"height": height,
"weight": weight,
"bmi": weight / (height / 100) ** 2,
}
Let's assume that in our situation, it is important to keep all the elements consistent. Hence, the display name should always consist of a first name and a last name, and the BMI should be calculated on the basis of weight and height. In order to prevent us from making a mistake when editing specific data components, we had to define them as separate variables. These are no longer required once a dictionary has been created. Assignment expressions enable the preceding dictionary to be written in a more concise way:
user = {
"first_name": (first_name := "John"),
"last_name": (last_name := "Doe"),
"display_name": f"{first_name} {last_name}",
"height": (height := 168),
"weight": (weight := 70),
"bmi": weight / (height / 100) ** 2,
}
As you can see, we had to wrap assignment expressions with parentheses. Unfortunately, the :=
syntax clashes with the :
character used as an association operator in dictionary literals and parentheses are a way around that.
Assignment expressions are a tool for polishing your code and nothing more. Always make sure that once applied, they actually improve readability, instead of making it more obscure.
Type-hinting annotations, although completely optional, are an increasingly popular feature of Python. They allow you to annotate variable, argument, and function return types with type definitions. These type annotations serve documentational purposes, but can also be used to validate your code using external tools. Many programming IDEs are able to understand typing annotations and visually highlight potential typing problems. There are also static type checkers, such as mypy or pyright, that can be used to scan through the whole code base and report all typing errors of code units that use annotations.
The story of the mypy project is very interesting. It began life as the Ph.D. research of Jukka Lehtosalo, but it really started to take shape when he started working on it together with Guido van Rossum (Python creator) at Dropbox. You can learn more about that story from the farewell letter to Guido on Dropbox's tech blog at https://blog.dropbox.com/topics/company/thank-you--guido.
In its simplest form, type hinting can be used with a conjunction of the built-in or custom types to specify desired types, function input arguments, and return values, as well as local variables. Consider the following function, which allows the performance of the case-insensitive lookup of keys in a string-keyed dictionary:
from typing import Any
def get_ci(d: dict, key: str) -> Any:
for k, v in d.items():
if key.lower() == k.lower():
return v
The preceding example is, of course, a naïve implementation of a case-sensitive lookup. If you would like to do this in a more performant way, you would probably require a dedicated class. We will eventually revisit this problem later in the book.
The first statement imports from the typing
module the Any
type, which defines that the variable or argument can be of any type. The signature of our function specifies that the first argument, d
, should be a dictionary, while the second argument, key
, should be a string. The signature ends with the specification of a return value, which can be of any type.
If you're using type checking tools, the preceding annotations will be sufficient to detect many mistakes. If, for instance, a caller switches the order of positional arguments, you will be able to detect the error quickly, as the key
and d
arguments are annotated with different types. However, these tools will not complain in a situation where a user passes a dictionary that uses different types for keys.
For that very reason, generic types such as tuple
, list
, dict
, set
, frozenset
, and many more can be further annotated with types of their content. For a dictionary, the annotation has the following form:
dict[KeyType, ValueType]
The signature of the get_ci()
function, with more restrictive type annotations, would be as follows:
def get_ci(d: dict[str, Any], key: str) -> Any: ...
In older versions of Python, built-in collection types could not be annotated so easily with types of their content. The typing
module provides special types that can be used for that purpose. These types include:
typing.Dict
for dictionariestyping.List
for liststyping.Tuple
for tuplestyping.Set
for setstyping.FrozenSet
for frozen setsThese types are still useful if you need to provide functionality for a wide spectrum of Python versions, but if you're writing code for Python 3.9 and newer releases only, you should use the built-in generics instead. Importing those types from typing modules is deprecated and they will be removed from Python in the future.
We will take a closer look at typing annotations in Chapter 4, Python in Comparison with Other Languages.
Python is quite flexible when it comes to passing arguments to functions. There are two ways in which function arguments can be provided to functions:
For many functions, it is the choice of the caller in terms of how arguments are passed. This is a good thing because the user of the function can decide that a specific usage is more readable or convenient in a given situation. Consider the following example of a function that concatenates the strings using a delimiter:
def concatenate(first: str, second: str, delim: str):
return delim.join([first, second])
There are multiple ways in terms of how this function can be called:
concatenate("John", "Doe", " ")
concatenate(first="John", second="Doe", delim=" ")
concatenate("John", "Doe", delim=" ")
If you are writing a reusable library, you may already know how your library is intended to be used. Sometimes, you may know from your experience that specific usage patterns will make the resulting code more readable, or quite the opposite. You may not be certain about your design yet and want to make sure that the API of your library may be changed within a reasonable time frame without affecting your users. In either case, it is a good practice to create function signatures in a way that supports the intended usage and also allows for future extension.
Once you publish your library, the function signature forms a usage contract with your library. Any change to the argument names and their ordering can break applications of the programmer using that library.
If you were to realize at some point in time that the argument names first
and second
don't properly explain their purpose, you cannot change them without breaking backward compatibility. That's because there may be a programmer who used the following call:
concatenate(first="John", second="Doe", delim=" ")
If you want to convert the function into a form that accepts any number of strings, you can't do that without breaking backward compatibility because there might be a programmer who used the following call:
concatenate("John", "Doe", " ")
Fortunately, Python 3.8 added the option to define specific arguments as positional-only. This way, you may denote which arguments cannot be passed as keyword arguments in order to avoid issues with backward compatibility in the future. You can also denote specific arguments as keyword-only. Careful consideration as to which arguments should be passed as position-only and which as keyword-only serves to make the definition of functions more susceptible to future changes. Our concatenate()
function, defined with the use of positional-only and keyword-only arguments, could look as follows:
def concatenate(first: str, second: str, /, *, delim: str):
return delim.join([first, second])
The way in which you read this definition is as follows:
/
mark are positional-only arguments*
mark are keyword-only argumentsThe preceding definition ensures that the only valid call to the concatenate()
function would be in the following form:
concatenate("John", "Doe", delim=" ")
And if you were to try to call it differently, you would receive a TypeError
error, as in the following example:
>>> concatenate("John", "Doe", " ")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: concatenate() takes 2 positional arguments but 3 were given
Let's assume that we've published our function in a library in the last format and now we want to make it accept an unlimited number of positional arguments. As there is only one way in which this function can be used, we can now use argument unpacking to implement the following change:
def concatenate(*items, delim: str):
return delim.join(items)
The *items
argument will capture all the positional arguments in the items
tuple. Thanks to such changes, users will be able to use the function with a variable number of positional items, as in the following examples:
>>> concatenate("John", "Doe", delim=" ")
'John Doe'
>>> concatenate("Ronald", "Reuel", "Tolkien", delim=" ")
'Ronald Reuel Tolkien'
>>> concatenate("Jay", delim=" ")
'Jay'
>>> concatenate(delim=" ")
''
Positional-only and keyword-only arguments are a great tool for library creators as they create some space for future design changes that won't affect their users. But they are also a great tool for writing applications, especially if you work with other programmers. You can utilize positional-only and keyword-only arguments to make sure that functions will be invoked as intended. This may help in future code refactoring.
Handling time and time zones is one of the most challenging aspects of programming. The main reasons are numerous common misconceptions that programmers have about time and time zones. Another reason is the never-ending stream of updates to actual time zone definitions. And these changes happen every year, often for political reasons.
Python, starting from version 3.9, makes access to the information regarding current and historical time zones easier than ever. The Python standard library provides a zoneinfo
module that is an interface to the time zone database either provided by your operating system or obtained as a first-party tzdata
package from PyPI.
Packages from PyPI are considered third-party packages, while standard library modules are considered first-party packages. tzdata
is quite unique because it is maintained by CPython's core developers. The reason for extracting the contents of the IANA database to separate packages on PyPI is to ensure regular updates that are independent from CPython's release cadence.
Actual usage involves creating ZoneInfo
objects using the following constructor call:
ZoneInfo(timezone_key)
Here, timezone_key
is a filename from IANA's time zone database. These filenames resemble the way in which time zones are often presented in various applications. Examples include:
Europe/Warsaw
Asia/Tel_Aviv
America/Fort_Nelson
GMT-0
Instances of the ZoneInfo
class can be used as a tzinfo
parameter of the datetime
object constructor, as in the following example:
from datetime import datetime
from zoneinfo import ZoneInfo
dt = datetime(2020, 11, 28, tzinfo=ZoneInfo("Europe/Warsaw"))
This allows you to create so-called time zone-aware datetime objects. Time zone-aware datetime objects are essential in properly calculating the time differences in specific time zones because they are able to take into account things such as changes between standard and daylight-saving time, together with any historical changes made to IANA's time zone database.
You can obtain a full list of all the time zones available in your system using the zoneinfo.available_timezones()
function.
Another interesting addition to the Python standard library is the graphlib
module, added in Python 3.9. This is a module that provides utilities for working with graph-like data structures.
A graph is a data structure consisting of nodes connected by edges. Graphs are a concept from the field of mathematics known as graph theory. Depending on the edge type, we can distinguish between two main types of graphs:
Moreover, graphs can be either cyclic or acyclic. A cyclic graph is a graph that has at least one cycle—a closed path that starts and ends at the same node. An acyclic graph is a graph that does not have any cycles. Figure 3.1 presents example representations of directed and undirected graphs:
Figure 3.1: Visual representations of various graph types
Graph theory deals with many mathematical problems that can be modeled using graph structures. In programming, graphs are used to solve many algorithmic problems. In computer science, graphs can be used to represent the flow of data or relationships between objects. This has many practical applications, including:
The graphlib
module is supposed to aid Python programmers when working with graphs. This is a new module, so it currently only includes a single utility class named TopologicalSorter
. As the name suggests, this class is able to perform a topological sort of directed acyclic graphs.
Topological sorting is the operation of ordering nodes of a Directed Acyclic Graph (DAG) in a specific way. The result of topological sorting is a list of all nodes where every node appears before all the nodes that you can traverse to from that node, in other words:
Some graphs may have multiple orderings that satisfy the requirements of topological sorting. Figure 3.2 presents an example DAG with three possible topological orderings:
Figure 3.2: Various ways to sort the same graph topologically
To better understand the use of topological sorting, let's consider the following problem. We have a complex operation to execute that consists of multiple dependent tasks. This job could be, for instance, migrating multiple database tables between two different database systems. This is a well-known problem, and there are already multiple tools that can migrate data between various database management systems. But for the sake of illustration, let's assume that we don't have such a system and need to build something from scratch.
In relational database systems, rows in tables are often cross-referenced, and the integrity of those references is guarded by foreign key constraints. If we would like to ensure that, at any given point in time, the target database is referentially integral, we would have to migrate our all the tables in specific order. Let's assume we have the following database tables:
customers
table, which holds personal information pertaining to customers.accounts
table, which holds information about user accounts, including their balances. A single user can have multiple accounts (for instance, personal and business accounts), and the same account cannot be accessed by multiple users.products
table, which holds information on the products available for sale in our system.orders
table, which holds individual orders of multiple products within a single account made by a single user.order_products
table, which holds information regarding the quantities of individual products within a single order.Python does not have any special data type dedicated to represent graphs. But it has a dictionary type that is great at mapping relationships between keys and values. Let's define references between our imaginary tables:
table_references = {
"customers": set(),
"accounts": {"customers"},
"products": set(),
"orders": {"accounts", "customers"},
"order_products": {"orders", "products"},
}
If our reference graph does not have cycles, we can topologically sort it. The result of that sorting would be a possible table migration order. The constructor of the graphlib.TopologicalSorter
class accepts as input a single dictionary in which keys are origin nodes and values are sets of destination nodes. This means that we can pass our table_references
variable directly to the TopologicalSorter()
constructor. To perform a topological sort, we can use the static_order()
call, as in the following transcript from an interactive session:
>>> from graphlib import TopologicalSorter
>>> table_references = {
... "customers": set(),
... "accounts": {"customers"},
... "products": set(),
... "orders": {"accounts", "customers"},
... "order_products": {"orders", "products"},
... }
>>> sorter = TopologicalSorter(table_references)
>>> list(sorter.static_order())
['customers', 'products', 'accounts', 'orders', 'order_products']
Topological sorting can be performed only on DAGs. TopologicalSorter
doesn't check for the existence of cycles during initialization, although it will detect cycles during sorting. If a cycle is found, the static_order()
method will raise a graphlib.CycleError
exception.
Our example was, of course, straightforward and fairly easy to solve by hand. However, real databases often consist of dozens or even hundreds of tables. Preparing such a plan manually for databases that big would be a very tedious and error-prone task.
The features we've reviewed so far are quite new, so it will take some time until they become the mainstream elements of Python. That's because they are not backward compatible, and older versions of Python are still supported by many library maintainers.
In the next section, we will review a number of important Python elements introduced in Python 3.6 and Python 3.7, so we will definitely have wider Python version coverage. Not all of these new elements are popular though, so I hope you will still learn something.
Every Python release brings something new. Some changes are real revelations; they greatly improve the way we can program and are adopted almost instantly by the community. The benefits of other changes, however, may not be obvious at the beginning and they may require a little more time to really take off.
We've seen this happening with function annotations that were part of Python from the very first 3.0 release. It took years to build an ecosystem of tools that would leverage them. Now, annotations seem almost ubiquitous in modern Python applications.
The core Python developers are very conservative about adding new modules to the standard library and we rarely see new additions. Still, chances are that you will soon forget about using the graphlib
or zoneinfo
modules if you don't have the opportunity to work with problems that require manipulating graph-like data structures or the careful handling of time zones. You may have already forgotten about other nice additions to Python that have happened over the past few years. That's why we will do a brief review of a few important changes that happened in versions older than Python 3.7. These will either be small but interesting additions that could easily be missed, or things that simply take time to get used to.
We discussed the topic of debuggers in Chapter 2, Modern Python Development Environments. The breakpoint()
function was already mentioned there as an idiomatic way of invoking the Python debugger.
It was added in Python 3.7, so has already been available for quite some time. Still, it is one of those changes that simply takes some effort to get used to. We've been told and taught for many years that the simplest way to invoke the debugger from Python code is via the following snippet:
import pdb; pdb.set_trace()
It doesn't look pretty, nor does it look straightforward but, if you've been doing that every day for years, as many programmers have, you would have that in your muscle memory. Problem? Jump to the code, input a few keystrokes to invoke pdb
, and then restart the program. Now you're in the interpreter shell at the very same spot as your error occurs. Done? Go back to the code, remove import pdb; pdb.set_trace()
, and then start working on your fix.
So why should you bother? Isn't that something of a personal preference? Are breakpoints something that ever get to production code?
The truth is that debugging is often a solitary and deeply personal task. We often spend numerous hours struggling with bugs, looking for clues, and reading code over and over in a desperate attempt to locate that small mistake that is breaking our application. When you're deeply focused on finding the cause of a problem, you should definitely use something that you find the most convenient. Some programmers prefer debuggers integrated into IDEs. Some programmers don't even use debuggers, preferring elaborated print()
calls spread all over the code instead. Always choose whatever you find the most convenient.
But if you're used to a plain old shell-based debugger, the breakpoint()
can make your work easier. The main advantage of this function is that it isn't bound to a single debugger. By default, it invokes a pdb
session, but this behavior can be modified with a PYTHONBREAKPOINT
environment variable. If you prefer to use an alternative debugger (such as ipdb
, as mentioned Chapter 2, Modern Python Development Environments), you can set this environment variable to a value that will tell Python which function to invoke.
Standard practice is to set your preferred debugger in a shell profile script so that you don't have to modify this variable in every shell session. For instance, if you're a Bash user and want to always use ipdb
instead of pdb
, you could insert the following statement in your .bash_profile
file:
PYTHONBREAKPOINT=ipdb.set_trace()
This approach also works well when working together. For instance, if someone asks for your help with debugging, you can ask them to insert breakpoint statements in suspicious places. That way, when you run the code on your own computer, you will be using the debugger of your choice.
If you don't know where to put your breakpoint, but the application exits upon an unhandled exception, you can use the postmortem feature of pdb
. With the following command, you can start your Python script in a debugging session that will pause at the moment the exception was raised:
python3 -m pdb -c continue script.py
Since version 3.7, the Python interpreter can be invoked in dedicated development mode, which introduces additional runtime checks. These are helpful in diagnosing potential issues that may arise when running the code. In correctly working code, those checks would be unnecessarily expensive, so they are disabled by default.
Development mode can be enabled in two ways:
-X dev
command-line option of the Python interpreter, for instance:
python -X dev my_application.py
PYTHONDEVMODE
environment variable, for instance:
PYTHONDEVMODE=1 my_application
The most important effects that this mode enables are as follows:
SIGSEGV
, SIGFPE
, SIGABRT
, SIGBUS
, or SIGILL
system signalsWarnings emitted in development mode are indications that something does not work the way it should. They may be useful in finding problems that are not necessarily manifested as errors during the normal operation of your code, but may lead to tangible defects in the long term.
The improper cleanup of opened files may lead at some point to resource exhaustion of the environment your application is running in. File descriptors are resources, the same as RAM or disk storage. Every operating system has a limited number of files that can be opened at the same time. If your application is opening new files without closing them, at some point, it won't be able to open new ones.
Development mode enables you to identify such problems in advance. This is why it is advised to use this mode during application testing. Due to the additional overhead of checks enabled by development mode, it is not recommended to use this in production environments.
Sometimes, development mode can be used to diagnose existing problems, too. An example of really problematic situations is when your application experiences a segmentation fault.
When this happens in Python, you usually won't get any details of the error, except the very brief message printed on your shell's standard output:
Segmentation fault: 11
When a segmentation fault occurs, the Python process receives a SIGSEGV
system signal and terminates instantly. On some operating systems, you will receive a core dump, which is a snapshot of the process memory state recorded at the time of the crash. This can be used to debug your application. Unfortunately, in the case of CPython, this will be a memory snapshot of the interpreter process, so debugging will be taking place at the level of C code.
Development mode installs additional fault handler code that will output the Python stack trace whenever it receives a fault signal. Thanks to this, you will have a bit more information about which part of the code could lead to the problem. The following is an example of known code that will lead to a segmentation fault in Python 3.9:
import sys
sys.setrecursionlimit(1 << 30)
def crasher():
return crasher()
crasher()
If you execute this in Python interpreter with the -X dev
flag, you will get output similar to the following:
Fatal Python error: Segmentation fault
Current thread 0x000000010b04edc0 (most recent call first):
File "/Users/user/dev/crashers/crasher.py", line 6 in crasher
File "/Users/user/dev/crashers/crasher.py", line 6 in crasher
File "/Users/user/dev/crashers/crasher.py", line 6 in crasher
File "/Users/user/dev/crashers/crasher.py", line 6 in crasher
File "/Users/user/dev/crashers/crasher.py", line 6 in crasher
...
This fault handler can also be enabled outside of development mode. To do that, you can use the -X faulthandler
command-line option or set the PYTHONFAULTHANDLER
environment variable to 1
.
It's not easy to cause segmentation faults in Python. This often happens for some Python extensions written in C or C++ or functions called from shared libraries (such as DLLs, .dylib
, or .so
objects). Still, there are some known and well documented conditions where this problem can occur in pure Python code. The repository of the CPython interpreter includes a collection of such known "crashers." This can be found under https://github.com/python/cpython/tree/master/Lib/test/crashers.
Every Python class can define the custom __getattr__()
and __dir__()
methods to customize the dynamic attribute access of objects. The __getattr__()
function is invoked when a given attribute name is not found to capture a missing attribute lookup and possibly generate a value on the fly. The __dir__()
method is called when an object is passed to the dir()
function and it should return a list of object attribute names.
Starting from Python 3.7, the __getattr__()
and __dir__()
functions can be defined at module level. The semantics are similar to object methods. The __getattr__()
module-level function, if defined, will be called on a failed module member lookup. The __dir__()
function will be called when a module object is passed to the dir()
function.
This feature may be useful for library maintainers when deprecating module functions or classes. Let's imagine that we exposed our get_ci()
function from the Type-hinting generics section in an open source library called dict_helpers.py
. If we would like to rename the function to lookup_ci()
and still be allowed to import it under the old name, we could use the following deprecation pattern:
from typing import Any
from warnings import warn
def ci_lookup(d: dict[str, Any], key: str) -> Any:
...
def __getattr__(name: str):
if name == "get_ci":
warn(f"{name} is deprecated", DeprecationWarning)
return ci_lookup
raise AttributeError(f"module {__name__} has no attribute {name}")
The preceding pattern will emit DeprecationWarning
, regardless of whether the get_ci()
function is imported directly from a module (such as via from dict_helpers import get_ci
) or accessed as a dict_helpers.get_ci
attribute.
Deprecation warnings are not visible by default. You can enable them in development mode.
F-strings, also known as formatted string literals, are one of the most beloved Python features that came with Python 3.6. Introduced with PEP 498, they added a new way of formatting strings. Prior to Python 3.6, we already had two different string formatting methods. So right now, there are three different ways in which a single string can be formatted:
printf()
function from the C standard library:
>>> import math
>>> "approximate value of π: %f" % math.pi
'approximate value of π: 3.141593'
str.format()
method: This method is more convenient and less error-prone than % formatting, although it is more verbose. It enables the use of named substitution tokens as well as reusing the same value many times:
>>> import math
>>> " approximate value of π: {:f}".format(pi=math.pi)
'approximate value of π: 3.141593'
>>> import math
>>> f"approximate value of π: {math.pi:f}"
'approximate value of π: 3.141593'
Formatted string literals are denoted with the f
prefix, and their syntax is closest to the str.format()
method, as they use a similar markup for denoting replacement fields in formatted text. In the str.format()
method, the text substitutions refer to positional and keyword arguments. What makes f-strings special is that replacement fields can be any Python expression, and it will be evaluated at runtime. Inside strings, you have access to any variable that is available in the same namespace as the formatted literal.
The ability to use expressions as replacement fields makes formatting code simpler and shorter. You can also use the same formatting specifiers of replacement fields (for padding, aligning, signs, and so on) as the str.format()
method, and the syntax is as follows:
f"{replacement_field_expression:format_specifier}"
The following is a simple example of code executed in an interactive session that prints the first ten powers of the number 10 using f-strings and aligns the results using string formatting with padding:
>>> for x in range(10):
... print(f"10^{x} == {10**x:10d}")
...
10^0 == 1
10^1 == 10
10^2 == 100
10^3 == 1000
10^4 == 10000
10^5 == 100000
10^6 == 1000000
10^7 == 10000000
10^8 == 100000000
10^9 == 1000000000
The full formatting specification of the Python string forms a separate mini language inside Python. The best reference source for this is the official documentation, which you can find under https://docs.python.org/3/library/string.html. Another useful internet resource regarding this topic is https://pyformat.info/, which presents the most important elements of this specification using practical examples.
Underscores in numeric literals are probably one such feature that are the easiest to adopt, but still not as popular as they could be. Starting from Python 3.6, you can use the _
(underscore) character to separate digits in numeric literals. This facilitates the increased readability of big numbers. Consider the following value assignment:
account_balance = 100000000
With so many zeros, it is hard to tell immediately whether we are dealing with millions or billions. You can instead use an underscore to separate thousands, millions, billions, and so on:
account_balance = 100_000_000
Now, it is easier to tell immediately that account_balance
equals one hundred million without carefully counting the zeros.
One of the prevalent security mistakes perpetrated by many programmers is assuming randomness from the random
module. The nature of random numbers generated by the random
module is sufficient for statistical purposes. It uses the Mersenne Twister pseudorandom number generator. It has a known uniform distribution and a long enough period length that it can be used in simulations, modeling, or numerical integration.
However, Mersenne Twister is a completely deterministic algorithm, as is the random
module. This means that as a result of knowing its initial conditions (the seed number), you can generate the same pseudorandom numbers. Moreover, by knowing enough consecutive results of a pseudorandom generator, it is usually possible to retrieve the seed number and predict the next results. This is true for Mersenne Twister as well.
If you want to see how random numbers from Mersenne Twister can be predicted, you can review the following project on GitHub: https://github.com/kmyk/mersenne-twister-predictor.
That characteristic of pseudorandom number generators means that they should never be used for generating random values in a security context. For instance, if you need to generate a random secret that would be a user password or token, you should use a different source of randomness.
The secrets
module serves exactly that purpose. It relies on the best source of randomness that a given operating system provides. So, on Unix and Unix-like systems, that would be the /dev/urandom
device, and on Windows, it will be the CryptGenRandom
generator.
The three most important functions are:
secrets.token_bytes(nbytes=None)
: This returns nbytes
of random bytes. This function is used internally by secrets.token_hex()
and secrets.token_urlsafe()
. If nbytes
is not specified, it will return a default number of bytes, which is documented as "reasonable." secrets.token_hex(nbytes=None)
: This returns nbytes
of random bytes in the form of a hex-encoded string (not a bytes()
object). As it takes two hexadecimal digits to encode one byte, the resulting string will consist of nbytes × 2 characters. If nbytes
is not specified, it will return the same default number of bytes as secrets.token_bytes()
.secrets.token_urlsafe(nbytes=None)
: This returns nbytes
of random bytes in the form of a URL-safe, base64-encoded string. As a single byte takes approximately 1.3 characters in base64 encoding, the resulting string will consist of nbytes × 1.3 characters. If nbytes
is not specified, it will return the same default number of bytes as secrets.token_bytes()
.Another important, but often overlooked, function is secrets.compare_digest(a, b)
. This compares two strings or byte-like objects in a way that does not allow an attacker to guess if they at least partially match by measuring how long it took to compare them. A comparison of two secrets using ordinary string comparison (the ==
operator) is susceptible to a so-called timing attack. In such a scenario, the attacker can try to execute multiple secret verifications and, by performing statistical analysis, gradually guess consecutive characters of the original value.
At the time of writing this book, Python 3.9 is still only a few months old, but the chances are that when you're reading this book, Python 3.10 has either already been released or is right around the corner.
As the Python development processes are open and transparent, we have constant insight into what has been accepted in the PEP documents and what has already been implemented in alpha and beta releases. This allows us to review selected features that will be introduced in Python 3.10. The following is a brief review of the most important changes that we can expect in the near future.
Python 3.10 will bring yet another syntax simplification for the purpose of type hinting. Thanks to this new syntax, it will be easier to construct union-type annotations.
Python is dynamically typed and lacks polymorphism. As a result of this, functions can easily accept the same argument, which can be a different type depending on the call, and properly process it if those types have the same interface. To better understand this, let's bring back the signature of a function that allowed case-insensitive loopback of string-keyed dictionary values:
def get_ci(d: dict[str, Any], key: str) -> Any: ...
Internally, we used the upper()
method of keys obtained from the dictionary. That's the main reason why we defined the type of the d
argument as dict[str, Any]
, and the type of key
argument as str
.
However, the str
type is not the only built-in type that has the upper()
method. The other type that has the same method is bytes
. If we would like to allow our get_ci()
function to accept both string-keyed and bytes-keyed dictionaries, we need to specify the union of possible types.
Currently, the only way to specify type unions is through the typing.Union
hint. This hint allows the union of bytes
and str
types to be specified as typing.Union[bytes, str]
. The complete signature of the get_ci()
function would be as follows:
def get_ci(
d: dict[Union[str, bytes], Any],
key: Union[str, bytes]
) -> Any:
...
That is already verbose, and for more complex functions, it can get only worse. This is why Python 3.10 will allow the union of types using the |
operator to be specified. In the future, you will be able to simply write the following:
def get_ci(d: dict[str | bytes, Any], key: str | bytes) -> Any: ...
In contrast to type-hinting generics, the introduction of a type union operator does not deprecate the typing.Union
hint. This means that we will be able to use those two conventions interchangeably.
Structural pattern matching is definitely the most controversial new Python feature of the last decade, and it is definitely the most complex one.
The acceptance of that feature was preceded by numerous heated debates and countless design drafts. The complexity of the topic is clearly visible if we take a look over all the PEP documents that tried to tackle the problem. The following is a table of all PEP documents related to structural pattern matching (statuses accurate as of March 2021):
Date |
PEP |
Title |
Type |
Status |
23-Jun-2020 |
622 |
Structural Pattern Matching |
Standards Track |
Superseded by PEP 634 |
12-Sep-2020 |
634 |
Structural Pattern Matching: Specification |
Standards Track |
Accepted |
12-Sep-2020 |
635 |
Structural Pattern Matching: Motivation and Rationale |
Informational |
Final |
12-Sep-2020 |
636 |
Structural Pattern Matching: Tutorial |
Informational |
Final |
26-Sep-2020 |
642 |
Explicit Pattern Syntax for Structural Pattern Matching |
Standards Track |
Draft |
9-Feb-2021 |
653 |
Precise Semantics for Pattern Matching |
Standards Track |
Draft |
That's a lot of documents, and none of them are short. So, what is structural pattern matching and how can it be useful?
Structural pattern matching introduces a match statement and two new soft keywords: match
and case
. As the name suggests, it can be used to match a given value against a list of specified "cases" and act accordingly to the match.
A soft keyword is a keyword that is not reserved in every context. Both match
and case
can be used as ordinary variables or function names outside the match statement context.
For some programmers, the syntax of the match statement resembles the syntax of the switch statement found in languages such as C, C++, Pascal, Java, and Go. It can indeed be used to implement the same programming pattern, but is definitely much more powerful.
The general (and simplified) syntax for a match statement is as follows:
match expression:
case pattern:
...
expression
can be any valid Python expression. pattern
represents an actual matching pattern that is a new concept in Python. Inside a case
block, you can have multiple statements. The complexity of a match statement stems mostly from the introduction of match patterns that may initially be hard to understand. Patterns can also be easily confused with expressions, but they don't evaluate like ordinary expressions do.
But before we dig into the details of match patterns, let's take a look at a simple example of a match statement that replicates the functionality of switch statements from different programming languages:
import sys
match sys.platform:
case "windows":
print("Running on Windows")
case "darwin" :
print("Running on macOS")
case "linux":
print("Running on Linux")
case _:
raise NotImplementedError(
f"{sys.platform} not supported!"
)
This is, of course, a very straightforward example, but already shows some important elements. First, we can use literals as patterns. Second, there is a special _
(underscore) wildcard pattern. Wildcard patterns and other patterns that, from the syntax alone, can be proven to match always create an irrefutable case block. An irrefutable case block can be placed only as the last block of a match statement.
The previous example can, of course, be implemented with a simple chain of if
, elif
, and else
statements. A common entry-level recruitment challenge is writing a FizzBuzz program.
A FizzBuzz program iterates from 0 to an arbitrary number and, depending on the value, does three things:
Fizz
if the value is divisible by 3Buzz
if the value is divisible by 5FizzBuzz
if the value is divisible by 3 and 5This is indeed a minor problem, but you would be surprised how people can stumble on even the simplest things when under the stress of an interview. This can, of course, be solved with a couple of if
statements, but the use of a match statement can give our solution some natural elegance:
for i in range(100):
match (i % 3, i % 5):
case (0, 0): print("FizzBuzz")
case (0, _): print("Fizz")
case (_, 0): print("Buzz")
case _: print(i)
In the preceding example, we are matching (i % 3, i % 5)
in every iteration of the loop. We have to do both modulo divisions because the result of every iteration depends on both division results. A match expression will stop evaluating patterns once it finds a matching block and will execute only one block of code.
The notable difference from the previous example is that we used mostly sequence patterns instead of literal patterns:
0, 0)
pattern: This will match a two-element sequence if both elements are equal to 0.(0, _)
pattern: This will match a two-element sequence if the first element is equal to 0. The other element can be of any value and type.(_, 0)
pattern: This will match a two-element sequence if the second element is equal to 0. The other element can be of any value and type._
pattern: This is a wildcard pattern that will match all values.Match expressions aren't limited to simple literals and sequences of literals. You can also match against specific classes and actually, with class patterns, things start to get really magical. That's definitely the most complex part of the whole feature.
At the time of writing, Python 3.10 hasn't yet been released, so it's hard to show a typical and practical use case for class matching patterns. So instead, we will take a look at an example from an official tutorial. The following is a modified example from the PEP 636 document that includes a simple where_is()
function, which can match against the structure of the Point
class instance provided:
class Point:
x: int
y: int
def __init__(self, x, y):
self.x = x
self.y = y
def where_is(point):
match point:
case Point(x=0, y=0):
print("Origin")
case Point(x=0, y=y):
print(f"Y={y}")
case Point(x=x, y=0):
print(f"X={x}")
case Point():
print("Somewhere else")
case _:
print("Not a point")
A lot is happening in the preceding example, so let's iterate over all the patterns included here:
Point(x=0, y=0)
: This matches if point
is an instance of the Point
class and its x
and y
attributes are equal to 0.Point(x=0, y=y)
: This matches if point
is an instance of the Point
class and its x
attribute is equal to 0. The y
attribute is captured to the y
variable, which can be used within the case block.Point(x=x, y=0)
: This matches if point
is an instance of the Point
class and its y
attribute is equal to 0. The x
attribute is captured to the x
variable, which can be used within the case block.Point()
: This matches if point
is an instance of the Point
class._
: This always matches.As you can see, pattern matching can look deep into object attributes. Despite the Point(x=0, y=0)
pattern looking like a constructor call, Python does not call an object constructor when evaluating patterns. It also doesn't inspect arguments and keyword arguments of __init__()
methods, so you can access any attribute value in your match pattern.
Match patterns can also use "positional attribute" syntax, but that requires a bit more work. You simply need to provide an additional __match_args__
class attribute that specifies the natural position order of class instance attributes, as in the following example:
class Point:
x: int
y: int
__match_args__ = ["x", "y"]
def __init__(self, x, y):
self.x = x
self.y = y
def where_is(point):
match point:
case Point(0, 0):
print("Origin")
case Point(0, y):
print(f"Y={y}")
case Point(x, 0):
print(f"X={x}")
case Point():
print("Somewhere else")
case _:
print("Not a point")
And that's just the tip of the iceberg. Match statements are actually way more complex than we could demonstrate in this short section. If we were to consider all the potential use cases, syntax variants, and corner cases, we could potentially talk about them throughout the whole chapter. If you want to learn more about them, you should definitely read though the three "canonical" PEPs: 634, 635, and 636.
In this chapter, we've covered the most important language syntax and standard library changes that have happened over the last four versions of Python. If you're not actively following Python release notes or haven't yet transitioned to Python 3.9, this should give you enough information to be up to date.
In this chapter, we've also introduced the concept of programming idioms. This is an idea that we will be referring to multiple times throughout the book. In the next chapter, we will take a closer look at many Python idioms by comparing selected features of Python to different programming languages. If you are a seasoned programmer who has just recently transitioned to Python, this will be a great opportunity to learn the "Python way of doing things." It will also be an opportunity to see where Python really shines, and where it might still be behind the competition.